Introduction

This project explores a dataset containing the trip data of the fordgo bike in the greater San Francisco Bay area.

What is the structure of your dataset?

There are 183412 fordgobike trips in the dataset with 16 fields (duration_sec, start_time, end_time, start_station_id, start_station_name, start_station_latitude, start_station_longitude, end_station_id, end_station_name, end_station_latitude ,end_station_longitude, bike_id, user_type, member_birth_year, member_gender, bike_share_for_all_trip).

The main feature(s) of interest in the dataset

My major interest is the duration, understanding how it relates with other fields in the dataset.

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

Trip duration should be dependent on day of the week as you expect more bike trips on weekends to weekdays. Similarly, I would expect males to take more and longer trips than females as biking requires strength.

Data Wrangling

Univariate Analysis

I start with exploring the duration of trips, age of riders, trip per days, period and stations.

Plot doesn't show the real distribution as it's affected by the very low and high value observations in the data

Duration of most trips are less than 2000 seconds. Most trips are between 500 seconds and 1000 seconds, with a slight peak in 600seconds. Just as mentioned before, we can also see some trips taking more than 70000 seconds (outliers).

Most of the riders of FordBike are younger, with majority between the ages of 30 to 40. We can also notice that we have riders above 100 years. This is quite suprising as biking requires some physical strength.

Since all the trips in the dataset happened in February, we wouldn't proceed to analyse by month.

We would proceed to analyse trips by days

View trips by Days

From our plot, we get to see that most trips started and ended on the same day. Thursday had the highest number of trips, for both the start day and end day.
Only Monday has the same number of trips that started and ended in the same day.

Our analysis show that trips are started majorly in the morning and afternoon. Little difference is noticeable in the morning anf afternoon periods. Both periods accounting for over 76% of the trips. Fewer trips start at night, only about 23% of trips started at night.

Higher ride frequencies are noticed in the morning (7th, 8th and 9th hours). Similarly from the 16th to 18th hour, there is also a high frequency of ride.
This increased rides may be due to workers going to work and closing from work. This corroborates the previous analysis showing increased trip in the morning and afternoon.

San Francisco Caltrain Station 2, Market St at 10th St and Montgomery St Bart Station are the most active stations.
They rank highly in stations with the most start and end trips.

Approximately 91% of the users of fordbike are subscribers and only 9% are termed customers.

A low percentage of riders classify as others (2%). Just as expected, a higher percentage of the riders are male (75%) as against the 23% that are females.

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

The Trip Duration and Age distribution were rightly skewed and concentrated to a tail due to the high values in the extreme end. Several observations were lumped together. I had to use a log transform to better visualize the histogram.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

Most of my visualization across time needed extraction from the original time columns(start_time and end_time). From these 2 columns, I extracted the start_period, month of trip, day of trip, period of trip which helped to provide further insight on the data. Similarly, from the birth_time column, I was able to calculate the age of riders using FordBike.
I noticed that there were riders above 100 years and as were trips that lasted for almost a day.

Bivariate Exploration

I would explore the relationship the duration has with other variables like age, day of the week, period of the week, gender and user type.

A correlation of 0.006 show no correlation between the age of riders and duration in secs. The scatter plot also show no definite linear relationship between the 2 variables of interest. This is a suprising insight as I expected the age and duration to be negatively correlated. As age increases, the duration covered should proportionately reduce.

The high variation in the dataset prevents a proper view of the durations for various days of the week.

We can deduce that rides take a longer time on weekends than weekdays. One reason for this could be that riders on weekdays used fordbike to commute to work while on weekends, it's more for exercise and leisure.

The duration across gender show different distribution. Both the female and other gender cover longer distance than the male gender.

There's little difference in the duration of trips across the starting period. The duration of trips in the afternoon is slightly higher than other periods. This difference is also minimal.

The visual above show that trips taken by subscribers begin to reduce as the time period changes. Trips by subscribers was highest in the morning, reduced in the afternoon and became the lowest in the night. Unlike the customer user_type which peaked in the afternoon. Trips for customers was increased in the afternoon and reduced at night.

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

There's no correlation between the duration of a trip and the age of a rider. Although, I expected a little negative relationship as it's expected that older riders to take shorter trips but this isn't all suprising. From the age distribution, we get to understand that most riders are between 20 and 50, with little representation from younger and elder populace.

Longer trips are common in weekends than on weekdays.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I expected a longer trip duration from the male gender. Suprisingly, the male riders took shorter trips than the female and other gender classification.

Also, I noticed that the customer user_type took longer trips than the subscriber riders. This is could be due to customers probably needed the bikes for a one-off journey that was longer unlike a subscriber who already has a mapped out route.

Multivariate Analysis

Question 1 : Understand user_type across duration and day of the trip

I previously noticed that customers went on longer trips than the subscribers. I decided to check if those trips were specific to some days. Our visualization show that those trips were not specific. Infact, on every day of the week, customers went on longer trips that the subscribers. On average, a customer would spend about 1200 seconds on a trip as compared to 700 seconds from the subscribers. Although, on weekends, trips are longer for both user types.

Question 2 : I will like to understand how females and other genders take more trips than males

The gender type classified as others have the longest trip duration on all days. We can still see that the female gender took longer than the males on every day of the week in the month of february. It would be interesting to know the exact cause of this

Insights

Females moved more in the morning than any other period. Most trips occurred at 3 stations - Market St at 10th St, San Francisco Caltrain Station 2 and Berry St at 4th St.

For the Other gender, most trips occurred similarly in the morning but with little difference in the afternoon.Stations with most trips - Market St at 10th St, San Francisco Caltrain Station 2 and San Francisco Caltrain Station.

Males had more trips at Market St at 10th St, Powell St BART Station, and Montgomery St BART Station. Apart from the 3 most used station difference, we can see that males moved more in the afternoon than at morning. Probably the reason why they take shorter trips. Also the location of San Francisco Caltrain Station 2 which is the most busy station, could be a reason for the longer distance for females and other genders. The males had fewer trips from that station